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Abstract 



In this paper, I expand Shannon's definition of entropy into a new form of entropy that 
allows integration of information from different random events. Shannon's notion of en- 
tropy is a special case of my more general definition of entropy. I define probability using 
a so-called performance function, which is de facto an exponential distribution. Assum- 
QQ ■ ing that my general notion of entropy reflects the true uncertainty about a probabilistic 

O ■ event, I understand that our perceived uncertainty differs. I claim that our perception is 

the result of two opposing forces similar to the two famous antagonists in Chinese philos- 
^ , ophy: Yin and Yang. Based on this idea, I show that our perceived uncertainty matches 

^ \ the true uncertainty in points determined by the golden ratio. I demonstrate that the 

well-known sigmoid function, which we typically employ in artificial neural networks as a 
non-linear threshold function, describes the actual performance. Furthermore, I provide 
a motivation for the time dilation in Einstein's Special Relativity, basically claiming that 
although time dilation conforms with our perception, it does not correspond to reality. 
At the end of the paper, I show how to apply this theoretical framework to practical 
applications. I present recognition rates for a pattern recognition problem, and also pro- 
pose a network architecture that can take advantage of general entropy to solve complex 
decision problems. 

Keywords: Information Theory, Entropy, Sensor Fusion, Machine Learning, 
Perception, Special Relativity. 
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1 Introduction 



Uncertainty is our constant companion in everyday's decision making. Being able to 
deal with uncertainty is thus an essential requirement for intelligent behavior in real- 
world environments. Naturally, knowing the exact amount of uncertainty involved in 
a particular decision is a very useful information to have. Mathematically, the classic 
way of measuring the uncertainty for a random event is to compute its information 
based on the definition of entropy introduced by Shannon [15]. In this paper, however, 
I introduce a new, general form of entropy that is motivated by my earlier work on 
classifier combination [3, 4, 6]. The idea of classifier combination, or sensor fusion in 
general, is to combine the outcomes of different sub-optimal processes into one integrated 
result. Ideally, the integrated process performs better in the given application domain 
than each individual process alone. In order to integrate different processes into a single 
process, computers need to deal with the uncertainties involved in the outcomes of each 
individual process. For classifier combination, several combination schemes have already 
been suggested. The current state-of-the-art, however, has not given its final verdict on 
this issue yet. In my earlier work, I proposed an informational-theoretical approach to 
this problem. The main idea of this approach is to normalize confidence values in such 
a way that their nominal values match their conveyed information, which I measure on 
a training set in the application domain. The overall combined confidence for each class 
is then simply the sum of the normalized confidence values of each individual classifier. 
In this paper, I am going to elaborate on my earlier ideas by looking at them from the 
general entropy's point of view. 

I structured the paper as follows: Following this introduction. Section 2 repeats the 
definition of entropy as introduced by Shannon, and compares it to my new and more 
general definition. Section 3 provides a short introduction into my earlier work on infor- 
mational confidence and repeats the main postulates and their immediate consequences. 
Section 4 describes how I understand confidence as the result of an interplay of two 
opposing forces. In Section 5, this insight will show the sigmoid function of classic back- 
propagation networks in a different light, namely as a kind of mediator between these 
two forces. A closer inspection in Section 6 reveals that the net effect of both opposing 
forces equals one single force in points defined by the golden ratio. In Section 7, I relate 
the introduced forces to the well-known forces of Yin and Yang in Chinese philosophy. 
In particular, I show how we can derive the typical Yin- Yang symbol from the assump- 
tions made. In Section 8, I explore common grounds of the general framework presented 
here and Einstein's Special Relativity. I provide an interesting motivation for the time 
dilation in Einstein's Special Relativity. Section 9 is then going to show how we can 
learn informational confidence values, illustrating the learning process with a practical 
example of handwritten Japanese character recognition. This section also proposes a 
network architecture for learning based on the ideas introduced in the previous sections. 
Finally, a summary with the main results concludes the paper. 
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2 Entropy 



Entropy is a measure for the uncertainty in a random event or signal. Alternatively, we 
can understand entropy as the amount of information conveyed by the random event or 
carried by the signal. Entropy is a general concept that has applications in statistical 
mechanics, thermodynamics, and of course information theory in computer science. The 
latter will be the focus of my attention in the following. At the end of my paper, I will 
present an interesting connection with Einstein's Special Relativity and physics, though. 

Claude E. Shannon introduced entropy as a measure for randomness in his 1948 
seminal paper "A Mathematical Theory of Communication." For a discrete random 
event with n possible outcomes. Shannon defines the entropy H as the sum of expected 
information Ki for each outcome i: 

n 

H^Y.Kr (1) 

1=1 

Shannon uses the negative logarithm to compute information itself. In this way, he can 
simply add the information of two independent outcomes to get the combined information 
of both. Accordingly, each in (1) reads as follows: 

i^, = -p(z)ln(p(z)), (2) 

with p{i) denoting the probability of the i-th outcome. The entropy reaches a maximum 
when all p{i) are equal, which indicates maximum uncertainty. On the other hand, the 
entropy is minimal; i.e. zero, if exactly one is 1 and all other outcomes have a 
probability of zero. 

I am now introducing the following more general variant that I will be using instead 
of (2) to compute the entropy H: 

Ki = -p{K,) In {p{Ki)) (3) 

In this new form, the expected information for each outcome appears on both sides 
of the equation, effectively making (3) a fixed point equation. Instead of using the 
probability p{i) of an outcome, I am now using the probability of the outcome's specific 
information. I also do not require the sum of all probabilities p{Ki) to be one. 

A straightforward comparison of (2) and (3) shows that Shannon's definition of en- 
tropy and its more general variant are the same if each outcome satisfies the following 
equation: 

p{t)=piKi) (4) 

In other words, both definitions of entropy are the same when the probability of each 
outcome matches the probability of its information, which we can consider to be a fixed 
point. 

The next section gives a motivation for the general entropy formula using pattern 
recognition, and in particular classifier combination, as a practical example. 
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3 Informational Confidence 

Pattern recognition is a research field in computer science dealing with the automatic 
classification of pattern samples into different classes. Depending on the application 
domain, typical classes are; e.g., characters, gestures, traffic signs, faces, etc. For a given 
unknown test pattern, most classifiers return both the actual classification result in form 
of a ranked list of class labels, and corresponding values indicating the confidence of 
the classifier in each class label. I will be using the term "confidence value" for these 
values throughout the paper, but I should mention that other researchers may prefer 
different terms, such as "score" or "likehhood." In practical classifier systems, confidence 
values are usually only rough approximations of their mathematically correct values. In 
particular, they very often do not meet the requirements of probabilities. While this 
usually docs not hamper the operation of a single classifier, which only depends on the 
relative proportion of confidence values, it causes problems in multiple classifier systems, 
which need the proper values for combination purposes. Post-processing steps, such as 
linguistic context analysis for character recognition, can also benefit from more accurate 
confidence values. 

Combination of different classifiers in a multiple classifier systems has turned out to 
be a powerful tool for reducing the uncertainty involved in a classification problem [8]. 
Researchers have shown in numerous experiments that the performance of the combined 
classifiers can outperform the performance of each single classifier. Nevertheless, re- 
searchers are still undecided about how to best integrate the confidence values of each 
individual classifier into one single confidence. In earlier work, I proposed so-called infor- 
mational confidence values as as a way to combine multiple confidences values [3, 4, 6]. 
The idea of informational confidence values is to introduce a standard of comparison 
allowing fair comparison and easy integration of confidence values generated by differ- 
ent classifiers. The definition of informational confidence values relies on two central 
postulates: 

1 Confidence is information 

2 Information depends on performance 

The first postulate states that each confidence value conveys information, and it con- 
sequently requires that the nominal value of each confidence value should equal the 
information conveyed. The second postulate then logically continues by requiring that 
the amount of information conveyed should depend on the performance of the confidence 
value in the application domain. From both postulates taken together, I can follow that 
confidence depends on performance via information. To formalize these requirements, let 
me assume that each classifier C can output confidence values from a set of confidence 
values Kci with 

Kc^{K^,K^,...,Ki,...,KM} (5) 

Let me further assume that indicates the highest confidence classifier C can output. 
The following fixed point equation then defines a linear relationship between confidence 
and information, with the latter depending on the performance complement of each 
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confidence value. 



Ki = E*I{p{K,)) + C (6) 

We see that the confidence values Ki appear on both sides of Equation (6), essentially 
making it a fixed point equation with the so-called informational confidence values as 
fixed points. Using the performance complement ensures that higher confidence values 
with better performance convey more information than lower confidence values when we 
apply Claude Shannon's logarithmic notion of information [15]. According to Shannon, 
information of a probabilistic event is the negative logarithm of its probability. More 
information on Shannon's work and the implications of his strikingly simple definition of 
information can be found in [13, 14, 16]. 

By setting constant C to zero, inserting the negative logarithm as information func- 
tion I, and using 1 — p{Ki) as performance complement, I simplify Equation (6) to the 
following definition of informational confidence: 

Ki^-E*\n(l-p{K,)) (7) 

The still unknown parameters necessary to compute informational confidence values 
according to (7) are E and p{Ki). A straightforward transformation of (7) sheds more 
light on these two parameters: 

K, = -E*hi{l-p{Ki)) 
<S=^ e~~^^l-p{Ki) 

^ p{Ki)^l-e-^ (8) 



The result shows that the performance function p{Ki) describes an exponential distribu- 
tion with expectation value E. This follows from the general definition of an exponential 
density function e\{x) with parameter A: 

eA(x) = |o . A>0 (9) 

For each A, the enclosed area of the density function equals 1: 

oo oo 

J ex{x)dx = J X*e-^''dx^l VA > (10) 

-oo 

Figure 1 shows three different exponential densities differing in their parameter A, with 
A = 100, A = 20, and A = 10 respectively. The parameter A has a direct influence on 
the steepness of the exponential density function. The higher A, the steeper the density 
function. 

The corresponding distribution Ex{k), which describes the probability that the ran- 
dom variable assumes values lower than or equal to a given value k, computes as follows: 



Ex{k) = /' ex{x) 

J —oo 



dx 
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Exponential Density 



100'exp(-100*x) 
20 * exp(-20 * X 
10*exp-10*x 



Figure 1: Exponential density for A = 100, A = 20, and A = 10. 



Exponentiai Distribution 



l-e>:p(-lOO ■ x| 
1-exp(-20 ' X) 
1-exp{-10 * x) 



Figure 2: Exponential distribution for A = 100, A = 20, and A = 10. 



A * e"^^ dx 



-Xx 



1-e 



-\k 



(11) 



Figure 2 shows the distributions for the three different densities depicted in Figure f, 
with A = 100, A = 20, and A = 10. The parameter A influences again the steepness: A 
larger A entails a steeper distribution. For each parameter A, the distribution function 
converges on 1 with increasing confidence. Another important feature is the relation- 
ship between parameter A and the expectation value E of the exponentially distributed 
random variable. Both are in inverse proportion to each other, with E = j. Accord- 
ingly, the expectation values corresponding to the exponential densities in Figure 1, and 
distributions in Figure 2, are E = E = ^, and E — respectively. 

When we compare the performance specification in (8) with the exponential distri- 
bution in (11), we see that the only difference lies in the exponent of the exponential 
function. In fact, performance function and exponential distribution become identical for 
A = This result shows that the performance function p{Ki) describes the distribution 
of exponentially distributed confidence values with expectation E. We can therefore con- 
sider confidence as an exponentially distributed random variable with parameter X — ^. 
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The performance theorem summarizes this important result: 
Performance Theorem: 

A classifier C with performance p{K) provides informational confidence K — —E * 
ln(l — p{K)) if, and only if, p{K) is an exponential distribution with expectation E. 

The performance theorem explains the meaning and implications of the parameters E 
and p{K). For classifiers violating the performance theorem, the equation stated in the 
performance theorem allows to compute the proper informational confidence values as 
long as we know the specific values of E and p{K). Section 9 will later show how we can 
estimate these parameters on a given evaluation set. 

In the next section, I take the idea of informational confidence a step further and intro- 
duce a second type of confidence called counter-confidence, which describes the confidence 
of the classifier in the falseness of its output. The subsequent sections then elaborate on 
this concept and present new theoretical results and discuss their implications. 

4 Opposing Forces 

I am assuming that decision making is based on two opposing forces, one supporting 
a certain outcome and one arguing against it. In particular, I am going to propose a 
formalization of both forces, which I name Force A and Force B for the time being, 
based on the fixed point equation of the performance theorem. In fact, I postulate that 
Force A is already defined by this equation. Force B only differs in its interpretation of 
performance. 

4.1 Force A 

The first force, Force A, describes the confidence in a particular decision. Accordingly, 
I use the fixed point equation of informational confidence values as the definition of 
Force A: 



The left-hand side of this equation denotes the magnitude of Force A. It is the product 
of information in the Shannon sense and an expectation value in the statistical sense. As 
shown above, the performance function p{K) follows immediately as p{K) = 1 — e~"B. 
If the performance in the logarithmic expression on the right-hand side of (12) is 1, and 
the expectation E is positive, then A-Force becomes infinity. On the other hand, if the 
performance is zero, then the logarithm becomes zero and there is no A-force at all. 

4.2 Force B 

The second force. Force B, is defined similarly but performs complementary to Force A. 
Force B describes information that depends directly on the performance and not on 
the perforuiaucc complement. Accordingly, the following modified fixed point equation 
describes Force B: 





K 



E*ln {p{K)) 



(13) 
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The difference to Force A lies in tfie interpretation of the performance function p{K), 
which follows again from a straightforward transformation: 



K 



E*ln {p{K)) 



p{K) 



K 



(14) 



= e E 



We see that the performance function of Force B is similar to the performance of Force A. 
However, it looks at the problem from a different side. Instead of describing the area 
delimited by K under the exponential density curve, it describes the remaining area that 
is not dehmited. Parameter E is again a statistical expectation value. Unhke Force A, 
Force B becomes infinity for a performance equal to zero and positive expectation. It 
becomes zero whenever the performance is perfect, i.e. p{K) = 1. While Force A defines 
informational confidence values. Force B can be considered as defining informational 
counter-confidence values. 

4.3 Interplay of Forces 

Having defined both Force A and Force B, I postulate that all decision processes are the 
result of the interplay between these two forces. What we can actually experience when 
making decisions is the dominance one of these forces has achieved over its counterpart. 
Mathematically, I understand that this dominance is the net effect of both forces and 
thus use the difference between the defining equations in (12) and (13) to describe it: 



This equation is a fixed point equation itself. It describes the net force, which is the 
result of both forces acting simultaneously. Naturally, the net force becomes zero when 
Force A equals Force B. This is the case when either the expectation value is zero or 
the performance p{K) is 0.5. The net force becomes either infinity or minus infinity 
when one force dominates completely over its counterpart. In particular, the net force 
becomes infinity when Force A dominates with p{K) — 1 and minus infinity when Force B 
dominates with p{K) — 0. 

The following two sections are going to present two more interesting theoretical re- 
sults, which are a direct consequence of the net force defined by (15), namely the sigmoid 
function and the golden ratio. Section 7 will later relate Force A and Force B to the 
well-known antagonistic forces in Chinese philosophy: Yin and Yang. 

5 Sigmoid Function 

A closer look at the net force defined in (15) reveals that the performance function is 
indeed a well-known function. A straightforward derivation leads to the following result: 




(15) 
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Figure 3: Sigmoid function. 



K 



1 



e E — 



p{K) 



- 1 



p{K) 



1 



(16) 



It shows that the performance function is actually identical to the type of sigmoid function 
that classical feedforward network architectures very often use as threshold function. The 
traditional explanation for the use of this particular threshold function has always lain 
in its features of non-linearity and simplicity. Non-linearity increases the expressiveness 
of a neural network, allowing decision boundaries in feature space that a simple linear 
network would not be able to model. A neural network with only linear output functions 
would simply collapse into a single linear function, which cannot model complex decision 
boundaries. The other advantage of the sigmoid function in (16) is the simplicity of its 
derivation, which facilitates the backpropagation of errors during the training of neural 
networks. While these are surely important points, it now seems that the deeper meaning 
of the sigmoid function has more of an information-theoretical nature, as motivated 
above. 

Figure 3 shows the sigmoid function in (16) for four different parameters E, namely 
E = 1, E = ^, E = ^, and E = 2. As its name already suggests, the sigmoid function has 
an S-shape. It converges on towards negative infinity and on 1 towards infinity. The 
parameter E controls the steepness of the sigmoid function. For smaller values of E, the 
sigmoid function becomes steeper and approaches faster to either or 1 on both ends. 
Independent of E, the sigmoid function is always 0.5 for K — 0. 

6 The Golden Ratio 

I now assume that the performance of a given confidence value K always matches exactly 
the expectation, i.e. in other words E = p{K). Note that this corresponds to the form 
of the summands of the general entropy in Section 2. The net force equation in (15) will 
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then read as follows: 



K=-piK)*ln[—^j (17) 

Figure 4 depicts the net force in (17) graphically for performance values p{K) ranging 
from to 1. As we can see in Figure 4, the net force becomes zero for p{K) = and 
p{K) = 0.5. For performances higher than 0.5 and approaching 1, the net force diverges 
to infinity. Figure 4 also shows a mirrored variant of the net force, namely 

X = -(l-p(K))*ln(l^^) (18) 

This equation is a direct result of (17) after changing the sign and replacing the perfor- 
mance p{K) with its complement 1 —p{K). The net force and its mirrored variant both 
meet at p(K) = 0.5. We can actually consider p{K) = 0.5 as a transition point where the 
net force transforms into its mirrored variant. After the transition, we are still looking 
at the same problem. However, our point of view has changed and is now reflected by 
the mirrored net force. This will become important later in Section 7, where we relate 
these forces to Yin and Yang. 

For the time being, let us concentrate on the net force in (17). The net force and 
the counter-confidence (Force B) in (13), with E = p{K), become equal when the per- 
formance p{K) satisfies the following relationship: 

^ p{Ky + p{K) - 1 = Q 
^ p{K)^-l±^l 

<S=^ p{K) ^ 0.618 V -1.618 (20) 
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a b 
a+b 

a+b is to a as a is to b 



Figure 5: Golden Ratio. 

This transformation shows that counter-confidence and net force are the same for a 
performance of about 0.618, when just considering the positive performance value. In- 
terestingly, this transformation also shows that the two possible values satisfying (19), 
namely f« —1.618 and ^ 0.618, are precisely the negative values of the so-called golden 
ratio. Force B thus equals the compound effect of Force A and Force B for performances 
defined by the golden ratio. 

With a detailed introduction into the golden ratio being out of scope, I provide 
only some background information about the golden ratio, or golden mean as it is also 
called [10, 2] . The golden ratio is an irrational number, or rather two numbers, describing 
the proportion of two quantities. Expressed in words, two quantities are in the golden 
ratio to each other, if the whole is to the larger part as the larger part is to the smaller part. 
The whole in this case is simply the sum of both parts. Figure 5 shows an example of a 
line divided into two segments that are in the golden ratio to each other. Historically, the 
golden ratio was already studied by ancient mathematicians. It plays an important role 
in different fields like geometry, biology, physics, and others. Many artists and designers 
deliberately or unconsciously make use of it because it seems that artwork based on the 
golden ratio has an esthetic appeal, and features some kind of natural symmetry. Despite 
the fact that the golden mean is of paramount importance to so many fields, I think it 
is fair to say that we still do not have a full, or rather correct, understanding of its true 
meaning in science. Mathematically, the golden mean can be derived from the following 
equation, which describes the colloquial description given above in mathematical terms. 

^ = T (21) 
a b 

Accordingly, the golden mean, which is typically denoted by the Greek letter ip, is then 
given by the ratio of a and b, i.e. (/? = |. Using the relationship in (21), the golden 
ratio ifi can be resolved into two possible values: 

1 + ^5 l-Vb 
= — V— (22) 

^ fti 1.618 V -0.618 (23) 

Usually, the positive value (^ 1.618) is identified with (/?. Note that these values are the 
same as in (20), except that their signs are reversed. The reader interested in a thorough 
analysis of the golden mean can find more information and many practical examples in 
the references [10, 2]. 
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7 Yin and Yang 



I will now relate the above theoretical results with one of the oldest philosophical world 
views, namely the principle of Yin and Yang. In particular, I dare to advance the 
hypothesis that both Force A and Force B, which I defined respectively in (12) and (13) 
using fixed point equations, correspond to the two opposing forces Yin and Yang when 
we assume that expectation always equals performance, i.e. E = p{K). If this can indeed 
be confirmed by further observations, this ancient philosophical concept could play an 
important role in computer science. In fact, I will provide further evidence of this claim 
and also show how we can use the concept of Yin and Yang for machine learning. Let 
me begin with a short summary of the Yin/ Yang concept in Chinese philosophy. 

7.1 Philosophy 

The concept of Yin and Yang is deeply rooted in Chinese philosophy [23]. Its origin dates 
back at least 2500 years, probably much earlier, playing a crucial role in the oldest Chinese 
philosophical texts. Chinese philosophy has attached great importance to Yin/ Yang 
ever since. Today, the idea of Yin/Yang pervades fields as different as religion, sports, 
medicine, politics, and many more. The fact that the Korean national flag sports a 
Yin/ Yang symbol illustrates the emphasis laid on this concept in Asian countries. 

Yin and Yang stand for two principles that are oppositcs of each other, and which are 
constantly trying to gain the upper hand over each other. However, neither one will ever 
succeed in doing so, though one principle may temporarily dominate the other one. Both 
principles cannot exist without each other. It is rather the constant struggle between 
both principles that defines our world and produces the rhythm of life. According to 
Chinese philosophy, Yin and Yang are the foundation of our entire universe. They flow 
through, and thus affect, every being. Typical examples of Yin/Yang opposites are, for 
example, night/day, cold/hot, rest/activity, etc. 

Chinese philosophy does not conflne itself to a mere description of Yin and Yang. It 
also provides guidelines on how to live in accordance with Yin and Yang. The central 
statement is that Yin and Yang need to be in harmony. Any imbalance of an economical, 
biological, physical, or chemical system can be directly attributed to a distorted equilib- 
rium between Yin and Yang. For instance, an illness accompanied by fever is the result 
of Yang being too strong and dominating Yin. On the other hand, dominance of Yin 
could result, for instance, in a body shivering with cold. The optimal state every being, 
or system, should strive for is therefore the state of equilibrium between Yin and Yang. 
It is this state of equilibrium between Yin and Yang that Chinese philosophy considers 
the most powerful and stable state a system can assume. 

Yin and Yang can be further subdivided into Yin and Yang. For instance, "cold" can 
be further divided into "cool" or "chilly," and "hot" into "warm" or "boiling." Yin and 
Yang already carry the seed of their opposites: A dominating Yin becomes susceptible 
to Yang and will eventually turn into its opposite. On the other hand, a dominating 
Yang gives rise to Yin and will thus turn into Yin over time. This deflnes the perennial 
alternating cycle of Yin or Yang dominance. Only the equilibrium between Yin and Yang 
is able to overcome this cycle. 
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Figure 6: Yin and Yang. 
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Figure 7: Logarithmic spiral. 

7.2 Logarithmic Spirals 

Figure 6 depicts the well-known black and white symbol of Yin and Yang. The dots 
of different color in the area dehmited by each force symbolize the fact that each force 
bears the seed of its counterpart within itself. According to the principle of Yin and 
Yang outhned above, neither Yin nor Yang can be observed directly. Both Yin and 
Yang are intertwined forces always occurring in pairs, rather than being isolated forces 
independent from each other. In Chinese philosophy, Yin and Yang assume the form of 
spirals. I will now show that the net force in (17) is a spiral too. In order to do so, I will 
first introduce the general definition of the logarithmic spiral before I then illustrate the 
similarity to the famous Yin/Yang symbol. 

A logarithmic spiral is a special type of spiral curve, which plays an important role 
in nature. It occurs in all different kinds of objects and processes, such as moUusk shells, 
hurricanes, galaxies, and many more [1]. In polar coordinates (r, 9), the general definition 
of a logarithmic spiral is 

r = ae'^ (24) 

Parameter a is a scale factor determining the size of the spiral, while parameter b con- 
trols the direction and tightness of the wrapping. For a logarithmic spiral, the distances 
between the turnings increase. This distinguishes the logarithmic spiral from the Archi- 
median spiral, which features constant distances between turnings. Figure 7 depicts a 
typical example of a logarithmic spiral. Resolving (24) for 9 leads to the following general 
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Yin /Yang Model 




Figure 8: Yin- Yang Spirals. 



form of logarithmic spirals: 

0=1 ln(-] (25) 
b \aj 

In order to show that the net force in (17) defines a logarithmic spiral, and for the 
sake of easier illustration, 1 investigate the negative version of the net force in (17) and 
look at the polar coordinates (r, 9) it defines, namely: 

9^ -p{K)*\n( ^^^U ) and r ^ {I - p{K)) * (26) 
\l-p{K)) 

A comparison of (26) with the general form of logarithmic spirals in (25) shows that 
the net force does indeed describe a spiral. Both (25) and (26) match when we set the 
parameters a and h to the following values: 

a = 1 - p{K) and h = ^ (27) 

p{K) 

In particular, we can check that a and b are identical when p{K) equals the golden ratio. 
If we let p{K) run from to 1, and mirror the resulting spiral along both axes similar 
to Figure 4, we receive two spirals. Figure 8 shows both spirals plotted in a Cartesian 
coordinate system. Both spirals are, of course, symmetrical and their turnings approach 
the unit circle. A comparison of the Yin/Yang symbol of Figure 6 with the spirals in 
Figure 8 shows the strong similarities between both figures. A simple mirror operation 
transforms the spirals in Figure 8 into the Yin/ Yang symbol. 

The addition of a time dimension to Figure 8 generates a three-dimensional object. 
It resembles a funnel or trumpet that has a wide circular opening on the upper end 
and narrows towards the origin. Figure 9 depicts this "informational universe," which 
follows directly from the two-dimensional graphic in Figure 8 when I use the performance 
values as time coordinates for the third axis. Note that the use of performance as time is 
reasonable because the exponential distribution is typically used to model dynamic time 
processes and the expectation value is thus typically associated with time. This will also 
be an important point in the next section. 
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Figure 9: Informational Universe. 



8 Relativity 

This section discusses the net force in a wider context and from a physical point of view. 
I begin by revisiting the net force as introduced in (15): 



The net force describes the net effect of the two forces defined in (12) and (13), respec- 
tively. As I showed above, each force entails its own interpretation of the performance 
function p{K). However, the net effect of both forces in (28), which computes simply 
as the difference between both forces, provides no information about the interpretation 
of p{K). Both interpretations, i.e. the exponential distribution or its complement, are 
valid performances. In fact, the interpretation we use depends on our viewpoint and just 
changes the sign of the net force in (28). The previous result in (16) shows, that the 
sigmoid function provides the correct performance values once we have chosen our point 
of view. Accordingly, the performance will lie between and 0.5 for a negative net force 
and between 0.5 and 1 for a positive net force. The fact that there is no objectively 
correct viewpoint strongly resembles the principle of relativity, which plays a major role 
in physics. 

Motivated by the general entropy introduced at the beginning of this paper, I will now 
derive another interesting result relating to relativity. As I have introduced in Section 2, 
the general entropy is based on summands having the following form: 



We can easily see that each summand matches the definition of Force B introduced in (13) 
when the expectation equals the performance. For this reason, I consider Force B, or 




(28) 



(29) 
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rather general entropy, to be the more fundamental force of both Force A and Force B. 
Actually, I understand that the difference between Force A and Force B, i.e. the net 
force, describes merely our perception, while the general entropy defines the true uncer- 
tainty. The sigmoid function will thus provide the real performance values, allowing us 
to compute the actual general entropy. Spinning this thought further, I understand that 
we perceive reality in points defined by the golden ratio. Our perception will be different 
from reality except for performance values equal to the golden ratio. Let me present an 
interesting physical application of this idea: In physics, a typical performance function 
could be the velocity v of an object in relation to light speed c. This value should always 
lie within the range from to 1 because the current state-of-the-art assumes that no 
object can move faster than the speed of light. If we insert this relative speed into (19), 
which describes the relationship defining the golden ratio, we obtain the following result: 



The expression on the left-hand side is the well-known Lorentz factor, or rather the 
inverse Lorentz factor, which plays a crucial part in Einstein's special relativity. The 
Lorentz factor describes how mass, length, and time change for an object or system 
whose velocity approaches light speed. For a moving object, an observer will measure a 
shorter length, more mass, and a shorter time lapse between two events. These effects 
become more pronounced as the moving object approaches the speed of light. Depending 
on the relative speed to light, the Lorentz factor describes basically the ratio between the 
quantity measured for the observer and the quantity measured for the moving system. 
For instance, if t is the time measured locally by the observer, then the corresponding 
time t' measured for the moving system computes as follows: 



We can see that t' converges to zero for increasing speed, i.e. we can measure no time lapse 
for a system moving with light speed. Similar relationships hold for length and mass. 
However, time dilation is especially interesting because the exponential distribution is 
very often used to model the time between statistical events that happen at a constant 
average rate, such as radioactive decay or the time until the next system failure, as already 
mentioned in the previous section. The expectation value of the exponential distribution 
is then indeed time, namely the expected time until the next event. In this context, 
an expectation value in the form of the Lorentz factor makes perfect sense. Actually, 
time dilation can then be followed from the relationship in (30). However, according to 
the perceptual model introduced above, 1 understand that time dilation is merely our 
perception and does not reflect reality. The true performance follows when we use our 
observed performance as input to the sigmoid function, which then provides the actual 




(30) 




(31) 
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performance. For instance, an expectation value corresponding to a Lorentz factor with 
a relative speed (performance) of 0.5 leads to an observed performance of according 
to (30). Insertion of this observed performance into the sigmoid function leads to the 
following result: 

^ 1 a; 0.586 (32) 

Note that this result is shghtly larger than 0.5. 

This concludes my theoretical foray into the field of physics. We know from prac- 
tical experiments that the observation of a physical experiment can actually change its 
outcome. The classic example for this fact is the famous double slit experiment [21]. 
For this reason, some physicists have already suggested that they might have to include 
human perception into their models in order to develop a more complete and thus more 
powerful theory that can describe these effects. It remains to be seen to what extent the 
proposed perceptive model turns out to be useful in this respect. 



9 Informational Intelligence 

In this section, I am going to apply the concept of informational confidence to a practical 
problem. In order to do so, I divide this section into three subsections: In the first sub- 
section, 1 show how to learn informational confidence values by estimating the necessary 
parameters on an evaluation set. In the second subsection, 1 present practical recognition 
rates of a multiple classifier system for handwritten Japanese character recognition. In 
the third subsection, I propose a new framework for machine learning in the form of a 
network architecture that implements the ideas introduced above, in particular general 
entropy. 1 therefore use the term "informational intelligence" as the title for this section 
in order to convey the broader meaning of informational confidence. 



9.1 Informational Confidence Learning 

In most practical cases, classifiers do not provide informational confidence values. Their 
confidence values typically violate the fixed point equation in the performance theorem, 
indicating a distorted equilibrium between information and confidence. Classifier com- 
bination therefore calls for a second training process in addition to the classifier-specific 
training methods teaching each classifier the decision boundaries of each class. Accord- 
ingly, I consider learning of informational confidence values to be a 3-step process: In 
the first step, I train a classifier with its specific training method and training set. In 
the second step, I estimate the performance for each confidence value on an evaluation 
set. Finally. 1 compute new informational confidence values by inserting the performance 
estimates into the fixed point equation of the performance theorem. The newly com- 
puted informational confidence values are stored in a look-up table and will replace the 
original raw confidence values in all future classifications. The fixed point equation of 
the performance theorem then formulates as follows: 

j^new = _E*ln(l- p{Kf^)) , (33) 
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where p{Kf'^) is the performance estimate of each raw confidence value Kf"^, E is the 
expectation estimate, and K'^^'^ is the new informational confidence value subsequently 
replacing Kf'^. 

In the following, I show how I compute the estimates E and p^Kf'^) on the evaluation 
set [3, 4, 6]. 

9.1.1 Expectation Estimate E 

For the practical experiments in the next subsection, the classifier's global recognition 
rate R on the evaluation set will serve as the expectation estimate E. I additionally 
normalize the recognition rate R according to the overall information / (C) provided by 
classifier C . Following the computation of information for confidence values / (1 — p{K)), 
1 estimate / (C) using the performance complement [3, 4]: 

7(C) = -ln(l-i?), (34) 

Based on the estimate /(C), E computes as "•'^'VR, which maps the global recognition 
rate R to its normalized rate for a one-bit classifier. The fixed point equation in the 
performance theorem now formulates as follows: 

j^new ^ _ i(C^ In (^1 - p{Kf'^)) (35) 

This leaves us with the performance estimate as the only missing parameter to com- 
pute informational confidence values. 



9.1.2 Performance Estimate p{Kf'^) 

Motivated by the performance theorem, which states that the performance function fol- 
lows an exponential distribution, I propose an estimate that expresses performance as 
a percentage of the maximum performance possible. Accordingly, my relative perfor- 
mance estimate describes the different areas delimited by the confidence values under 
their common density function. Mathematically, the performance estimate is based on 
accumulated partial frequencies defined by the following formula [17, 18]: 

In this equation, N is the number of patterns contained in the evaluation set. The 
help function ncorrect{K^'^) returns the number of patterns correctly classified with con- 
fidence K^"^. The use of monotonously increasing frequencies guarantees that the es- 
timated informational confidence values will not affect the order of the original raw 
confidence values: 

For this reason, the performance estimate in (36) ensures that informational confidence 
values have no affect on the recognition rate of a single classifier, except for ties introduced 
by mapping two different confidence values to the same informational confidence value. 
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Ties can happen when two neighboring confidence values show the same performance and 
become indistinguishable due to insufficient evaluation data. In most applications, this 
should be no problem, though. Typically, the effect of informational confidence values 
shows only when we combine several classifiers into a multiple classifier system, with all 
classifiers learning their individual informational confidence values, unless we compute 
class-specific informational confidence values. 

Estimates based on accumulated partial frequencies act like a filter in that they do 
not consider single confidence values but a whole range of values. They average the 
estimation error over all confidence values in a confidence interval. This diminishes the 
negative effect of inaccurate measurements of the estimate p{Kf'^') in application domains 
with insufficient or erroneous evaluation data. Furthermore, estimation of informational 
confidence values can be considered a warping process aligning the progression of confi- 
dence values with the progression of performance. For experiments with other possible 
performance estimates, readers are referred to the references [3, 4, 6]. 

After normalization of the performance estimate p{Kf^) to a one-bit classifier, as I 
already did for the expectation estimate, the final version of the fixed point equation in 
the performance theorem reads as follows: 



Note that the newly computed informational confidence values K^^'^ are an attractor of 
this fixed point equation. In other words, the fixed point will be reached after exactly one 
iteration of the training procedure, or rather estimation process. All additional iterations 
will produce exactly the same confidence values; i.e., Kf^'^ — Kf"^. 

9.2 Practical Experiments 

In this mainly theoretical paper, I confine myself to practical experiments for a multiple 
classifier system developed to recognize handwritten Japanese characters. Readers will 
find more information in the references, including other experiments with informational 
confidence values for document processing applications [3, 4, 6]. Handwriting recogni- 
tion is a very promising application field for classifier combination. Multiple classifier 
systems have therefore a long tradition in handwriting recognition [22, 20]. In partic- 
ular, the duality of handwriting recognition, with its two branches off-line recognition 
and on-line recognition, makes it suitable for multiple classifier systems. While off-line 
classifiers process static images of handwritten words, on-line classifiers operate on the 
dynamic data and expect point sequences over time as input signals. Compared to the 
time-independent off-line representations used by off-line classifiers, on-line classifiers suf- 
fer from stroke-order and stroke-number variations inherent in human handwriting and 
thus in on-line data. On the other hand, on-line classifiers are able to exploit the dy- 
namic information and can very often discriminate between classes with higher accuracy. 
Off-line and on-line classifiers thus complement each other, and their combination can 
overcome the problem of stroke-order and stroke-number variations. This is especially 
important in Japanese and Chinese character recognition because the average number 
of strokes per character, and thus the number of variations, is much higher than in the 
Latin alphabet [5, 9]. 
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Japanese 


offline 


online 


AND 


OR 


1-best 


89.94 


81.04 


75.41 


95.56 


2-best 


94.54 


85.64 


82.62 


97.55 


3-best 


95.75 


87.30 


84.99 


98.06 



Table 1: Single n-best rates for handwritten Japanese character recognition. 

For my experiments, I use a multiple classifier system comprising two classifiers for 
on-line handwritten Japanese characters. Both classifiers are nearest neighbor classifiers. 
One of these two classifiers, however, transforms the captured on-line data into an off-line 
pictorial representation before applying the actual classification engine. This transfor- 
mation happens in a pre-processing step and connects neighboring on-line points using 
a sophisticated painting method [19, 7]. We can therefore consider this classifier to be 
an off-line classifier. As mentioned above, learning of informational confidence values is 
a three-step process: First, each classifier is trained with its standard training method 
on a given training set. Then, I compute the performance of each confidence value for 
each classifier on an evaluation set, using the performance estimate in (36). In the last 
step, I estimate the informational confidence values based on the estimate given in (38). 
These estimates will then replace the original confidence values in all future classifications 
of each classifier. In my experiments, each classifier was initially trained on a training 
set containing more than one million handwritten Japanese characters. The test and 
evaluation set contains 54, 775 handwritten characters. From this set, I take about two 
third of the samples to estimate the performances of confidence values and one third 
to compute the final recognition performance of the estimated informational confidence 
values. For more information about the classifiers and data sets used, I refer readers to 
the references [7, 11, 12]. 

Table 1 lists the individual recognition rates for the off-line and on-line classifier. It 
shows the probabilities that the correct class label is among the n-best alternatives having 
the highest confidence, with n — 1,2,3. The off-line recognition rates are much higher 
than the corresponding on-line rates. Clearly, stroke-order and stroke-number variations 
are largely responsible for this performance difference. They complicate considerably the 
classification task for the on-line classifier. The last two columns of Table 1 show the 
percentage of test patterns for which the correct class label occurs either twice (AND) or 
at least once (OR) in the n-best lists of both classifiers. The relatively large gap between 
the off-line recognition rates and the numbers in the OR-column suggests that on-line 
information is indeed complementary and useful for classifier combination. 

Table 2 shows the recognition rates for combined off-line/on-line recognition, using 
sum-rule, max-rule, and product-rule as combination schemes. Sum-rule adds the confi- 
dence values provided by each classifier for the same class, while product-rule multiplies 
the confidence values. Max-rule simply takes the maximum confidence without any fur- 
ther operation. The class with the maximum overall confidence will then be chosen as 
the most likely class for the given test pattern. Note that sum-rule is the mathematically 
appropriate combination scheme for integration of information from different sources [15]. 
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Japanese (89.94) 


Raw Confidence 


Inf. Confidence 


Sum-rule 


93.25 


93.78 


Max-rule 


91.30 


91.14 


Product-rule 


92.98 


65.16 



Table 2: Combined recognition rates for handwritten Japanese character recognition. 

In addition, sum-rule is robust against noise, as was shown in [8]. The upper left cell 
of Table 2 lists again the best single recognition rate from Table 1, achieved by the 
off-line recognizer. The second column contains the combined recognition rates for the 
raw confidence values as provided directly by the classifiers, while the third column lists 
the recognition rates for informational confidence values computed according to (38). 
Compared to the individual rates, the combined recognition rates in Table 2 are clear 
improvements. The sum-rule on raw confidence values already accounts for an improve- 
ment of almost 3.5%. The best combined recognition rate achieved with normalized 
informational confidence is 93.78%. It outperforms the off-line classifier, which is the 
best individual classifier, by almost 4.0%. Sum-rule performs better than max-rule and 
product-rule, a fact in accordance with the results in [8] . 

9.3 Neural Network Architecture 

At the end of this paper, I am going to show how the results introduced above can be 
combined to form a network architecture for complex decision problems. The architecture 
I propose is similar to the well-known feedforward type of artificial neural networks in that 
a neuron first integrates its inputs and then applies a sigmoid function to compute the 
final output, which it propagates to the synapses of other neurons. The main motivation 
for the sigmoid function, however, derives from an informational-theoretical background, 
as discussed in Section 5. Figure 10 shows the basic unit of the proposed "information 
network:" a neuron and its synapses. The basic idea is that each synapse computes 
one summand of the general entropy defined in (3) of Section 2. The main body of 
the neuron first integrates all these summands, computing the general entropy according 
to (1) and (3). The sigmoid function then computes the actual performance based on the 
general entropy. Finally, the neuron forwards the newly computed performance to other 
neurons, which in turn repeat the same process. In this way, complex decisions become 
aggregates of simpler decisions. 

Similar to the training process in feedforward networks, the backpropagation of feed- 
back trains the network in Figure 10. Instead of the gradient descent in parameter space 
that is typically implemented in feedforward networks, backpropagation for the network 
in Figure 10 means basically propagating the performance back so that each neuron 
can adjust its output. The performance can be directly inserted as part of the sigmoid 
function in (16). For instance, insertion of the performance values defined in (36) leads 
to the following expression for the output values, after additionally normalizing each 
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Figure 10: Information Network. 



performance value to one bit: 

(39) 



1 + ^(^^p(Xf'i) 



In my experiments, a simple summation of the information provided by each output value, 
or rather classifier, for each class provides a recognition rate of 93.92 for the handwritten 

character recognition problem. This is better than the best recognition rate in Table 2. 

I hope to be able to support the proposed network architecture with additional ex- 
periments in other application domains, and by implementing a full-fledged network and 
not just a single layer. 



10 Summary 

I introduced a new form of entropy that can be considered an extension of the classic 
entropy introduced by Shannon. Each summand of this entropy is a fixed point equa- 
tion in which the so-called performance function takes over the part of the probability. 
However, the performance function plays several roles in my approach: It describes the 
distribution of an exponentially distributed random variable, and is also an expectation 
value in the statistical sense. Furthermore, with the exponential distribution typically 
used to describe statistical time processes, there is also a point in favor of it being time. 
The performance theorem in the first part of the paper summarizes these relationships 
and provides guidelines for learning informational confidence values for classifier com- 
bination. In my first practical results published in [3, 4, 6], I improved the recognition 
rates for several multiple classifier systems. In the present paper, I confined myself to 
the recognition rates for handwritten Japanese character recognition and concentrated 
on theoretical issues. I showed how to produce a symbol similar to the famous Yin/ Yang 
symbol by depicting the net confidence as a spiral. The net confidence is the difference 
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between the confidence and counter-confidence, with the latter being based on the per- 
formance complement. My understanding is that our perception is always the composite 
of Yin and Yang and does not reflect the reality, except when the performance function 
equals the golden ratio. I thus assign an information-theoretical meaning to the golden 
ratio. Moreover, I understand that the sigmoid function provides the actual performance 
value that we cannot observe directly. Under these observations and assumptions, I can 
explain the time dilation of Einstein's Special Relativity. However, it follows that time 
dilation is mere perception and does not correspond to reality. At the end of the paper, 
I proposed a network architecture for complex decisions, which takes advantage of the 
general entropy concept. I hope that the usefulness of this architecture can be confirmed 
by future experiments in different application fields. 
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